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We formulate weighted graph clustering as a prediction problem^] given a subset 
of edge weights we analyze the ability of graph clustering to predict the remain- 
ing edge weights. This formulation enables practical and theoretical comparison 
of different approaches to graph clustering as well as comparison of graph clus- 
\^ tering with other possible ways to model the graph. We adapt the PAC-Bayesian 

analysis of co-clustering ( Seldin and Tishby 2008 ; Seldin 2009 ) to derive a PAC- 
^ Bayesian generalization bound for graph clustering. The bound shows that graph 

q clustering should optimize a trade-off between empirical data fit and the mutual 

'— 1 information that clusters preserve on the graph nodes. A similar trade-off derived 

from information-theoretic considerations was already shown to produce state-of- 
the-art results in practice (Slon im et al.||2005]|Yom-Tov and Slonim| 2009} . This 



> 

paper supports the empirical evidence by providing a better theoretical founda- 
q<^ tion, suggesting formal generalization guarantees, and offering a more accurate 

way to deal with finite sample issues. We derive a bound minimization algorithm 
C"^S and show that it provides good results in real-life problems and that the derived 

^ PAC-Bayesian bound is reasonably tight. 

O 

t-h 1 Introduction 

> 

Graph clustering is an important tool in data analysis with wide variety of applications including so- 
cial networks analysis, bioinformatics, image processing, and many more. As a result a multitude of 
different approaches to graph clustering were developed. Examples include graph cut methods ( Shi 

a .......... . . -. . \ ... . 



and Malik| |2000| l, spectral clustering (Ng et al. 2001 1, information-theoretic approaches (Slonim 
et al.| 2005[ ), to name just a few. Comparing the different approaches is usually a painful task, 
mainly because the goal of each of these clustering methods is formulated in terms of the solution: 
most clustering methods start by defining some objective functional and then minimizing it. But for 
a given problem how can we choose whether to apply a graph cut method, spectral clustering, or an 
information-theoretic approach? 

In this paper we formulate weighted graph clustering as a prediction problem^] Given a subset of 
edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The 
rational behind this formulation is that if a model (not necessarily cluster-based) is able to predict 
with high precision all edge weights of a graph given a small subset of edge weights then it is a good 
model of the graph. The advantage of this formulation of graph modeling is that it is independent 
of a specific way chosen to model the graph and can be used to compare any two solutions, either 
by comparison of generalization bounds or by cross-validation. The generalization bound or cross- 



1 Pairwise clustering is equivalent to clustering of a weighted graph, where edge weights correspond to 
pairwise distances. Hence, from this point on, we restrict the discussion to graph clustering. 

2 Unweighted graphs can be modeled by setting the weight of present edges as 1 and absent edges as 0. 
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validation also address the finite-sample nature of the graph clustering problem and provide a clear 
criterion for model order selection. For very large datasets, where computational constraints can 



prevent considering all edges of a graph, as for example in ( Yom-Tov and Slonim 



2009 ), the gener- 



alization bound can be used to resolve the trade-off between computational workload and precision 
of graph modeling. 

The formulation and analysis of graph clustering presented here are based on the analysis of co- 
clustering suggested in ( Seldin and Tishby , 2008 ; Seldin , 2009 ), which is reviewed briefly in section 
[2 In section [3 we adapt the analysis to derive PAC-Bayesian generalization bound for the graph 
clustering problem. The generalization bound depends on a trade-off between empirical fit of the 
cluster structure to the graph and the amount of mutual information that the clusters preserve on 
the graph nodes. This trade-off is related to the objective of a successful graph clustering algorithm 
Iclust ( [Slonim et al. 2005 I. We discuss this relation in section [4] In section [5] we suggest an 
algorithm for minimization of our bound and, finally, in section|6]we present some experiments with 
real-world data and analyze the tightness of the bound. 



2 Review of PAC-Bayesian Analysis of Co-clustering 

Co-clustering is a widely used method for analysis of data in the form of a matrix by simultaneous 
clustering of rows and columns of the matrix (Banerjee et al. , 2007 ). A good illustrative example of a 



co-clustering problem is collaborative filtering^Herlocker et al. 2004[). In collaborative filtering one 



is given a matrix of viewers by movies with ratings given by the viewers to the movies. The matrix 
is usually sparse and the task is to predict the missing entries. We assume that there is an unknown 
probability distribution p(Xx, X%,Y) over the triplets of viewer X%, movie X2, and rating Y. The 
goal is to build a discriminative predictor q(Y\Xi, X%) that given a viewer and movie pair will pre- 
dict the expected rating Y. A natural form of evaluation of such predictors, no matter whether they 
are based on co-clustering or not, is to evaluate the expected loss E p (Xi,x 2 ,F)E 9 (r'|x 1 ,x 2 )'(^ Y'), 
where l(Y, Y') is an externally provided loss function for predicting Y' instead of Y. 



2.1 PAC-Bayesian Analysis of Discriminative Prediction with Co-clustering 

Let X\ x .. x Xd x y be a (d+ 1) -dimensional product space and assume that each Xi is categorical 
and its cardinality \X.- L \ is fixed and known. We also assume that y is finite with cardinality \Y\ and 
that the loss function l(Y, Y') is bounded. In the collaborative filtering example X\ is the space of 
viewers, X2 is the space of movies, d = 2, and y is the space of ratings (e.g., on a five-star scale). 
The loss l(Y,Y') can be, for example, an absolute loss l(Y,Y') = \Y — Y'\ or a quadratic loss 
l(Y,Y') = (Y-Y 1 ) 2 . 

We assume an existence of an unknown probability distribution p{X\, .., Xd, Y) over X\ x ..x Xd^y 
and that a training sample of size TV is generated i.i.d. according to p. We use p(Xi, .., Xd, Y) to 
denote the empirical frequencies of [d + 1) -tuples (X\, .., Xd, Y) in the sample. We consider the 
following form of discriminative predictors: 

d 

q(Y\X 1 ,..,X d )= Y, q(Y\Ci,..,C d )ilq(Ci\Xi). (1) 

Ci,..,C d 1=1 

The hidden variables C\, ..,Cd represent a clustering of X\, .., Xd- The hidden variable Cj accepts 
values in {1, .., |Cj|}, where |Cj| is the number of clusters used along dimension i. The free pa- 
rameters of the model ([TJ are the conditional probability distributions q(Ci\Xi) which represent 
the probability of assigning Xi to cluster Cj and the conditional probability q(Y\C\, .., Cd) which 
represents the probability of assigning label Y to cell (Ci, .., Cd) in the cluster product space. We 
denote the free parameters collectively by Q = {{q(Ci\Xi)}f =1 , q(Y\Ci, .., Cd)}- We define the 
expected and empirical losses L(Q) and L(Q) of the prediction strategy defined by Q as: 

L(Q)=E piXl _ XdX) E q(Y , lXl _ Xd) l(Y,Y r ), (2) 

L(Q) - K KXu ,^ Y) E q(Y , lXu „ :Xd) l(Y, Y% (3) 

where q(Y \X\, ..,Xd) is defined by ([TJ. We define the mutual information I(X{; Ci) corresponding 
to the joint distribution q(Xf, Ci) — -^-^(C^Xi) defined by q(d\Xi) and a uniform distribution 
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over Xi as: 

where g(c,) = p^-j J^ x . g(cj|a;,) is the marginal distribution over Cj. Finally, we denote the KL- 
divergence between two Bernoulli distributions with biases L(Q) and L(Q) by 

fcl(i«2)||£(fi)) = £(fi) In ||g + (1 - L(C)) ln |~^g . (5) 

The following generalization bound for discriminative prediction with co-clustering was proved in 
HSeldin||2009) . 

Theorem 1. For any probability measure p(X\, .., Y) over Xi x .. x X# x y and for any loss 
function I bounded by 1, with a probability of at least 1 — 5 over a selection of an i.i.d. sample S of 
size N according to p, for all randomized classifiers Q — {{q(Ci\Xi)}f =1 ,q(Y\C\, ..,Cd)}'- 

Eti C i) + \Oi\ In \X t \) + (nti \Ci\) In \Y\ + \ ln(47V) - In 5 
kl(L(Q)\\L(Q)) < ^ 1 . 

(6) 

In practice |Seldin| ( |2009[ ) replace |6]) with a parameterized trade-off 

d 
i=l 

and suggest an alternating projection algorithm for finding a local minimum of J-(Q) (for a fixed 
fi). Bound |6]l is minimized by applying a linear search over /3 and substituting L(Q) and I(Xi\ Ci) 
obtained from optimization of T{Q) back into |6|. Alternatively, the value of j3 can be tuned by 
cross-validation. This algorithm achieved state-of-the-art performance on the MovieLens collabora- 
tive filtering dataset. Below we adapt this analysis and algorithm to the graph clustering problem. 

3 Formulation and Analysis of Graph Clustering 
3.1 Graph Clustering as a Prediction Problem 

Assume that X is a space of \X\ nodes and denote by Wij the weight of an edge connecting nodes 
i and jj^j We assume that the weights iUy are generated according to an unknown probability dis- 
tribution p(W\Xi, X 2 ), where Xi,X 2 € A" are the edge endpoints. We further assume that we 
know the space of nodes X and are given a sample of size N of edge weights, generated according 
to p(Xi, X 2 , W). The goal is to build a regression function q(W\X\, X%) that will minimize the 
expected prediction error of the edge weights E p (x 1 ,X2,w)^q(W'\x 1 ,x 2 )KW, W) for some exter- 
nally given loss function l(W, W). Note that this formulation does not assume any specific form of 
q(W\Xi, X2) and enables comparison of all possible approaches to this problem. 



3.2 PAC-Bayesian Analysis of Graph Clustering 

In this work we analyze the generalization abilities of q(W\Xi, X 2 ) based on clustering: 

q(W\X 1 ,X 2 )= J2 q(W\C 1 ,C 2 )q(C 1 \X 1 )q(C 2 \X 2 ). (8) 

One can immediately see the relation between ([8]l and ([T|. The only difference is that in the nodes 
Xi, X 2 belong to the same space of nodes X and the conditional distribution q(C\X) is shared for 
the mapping of endpoints of an edge. Let p(Xi,X 2 , W) be the empirical distribution over edge 

3 A11 the results can be straightforwardly extended to hyper-graphs. 
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weights. The empirical loss of a prediction strategy Q = {q(C\X), q(W\C\, C 2 )} corresponding 
to ([8]) can then be written as: 

L(Q) = Ep (XuX2tW) E q(w , lXuX2) l(W, W). (9) 

The following generalization bound for graph clustering can be proved by a minor adaptation of the 
proof of theorem [T] 

Theorem 2. For any probability measure p(X\, X 2 , W) over the space of nodes and edge weights 
X x X x VV and for any loss function I bounded by 1, with a probability of at least 1 — 5 over a 
selection of an i.i.d. sample S of size N according to p, for all graph clustering models defined by 
Q = {q(C\X),q(W\C 1 ,C 2 )}: 

kl(L{mLm < W/(X;C) + |qi nW + ^ln r | + il n (4iV)-ln^ ^ 

where \C\ is the number of node clusters and \ W\ is the number of distinct edge weights. 

The limitation of working with a fixed set of allowed edge weights is resolved by weight quantization 
in section l5TI 

Although there is no analytical expression for the inverse KL-divergence, given ( fT0| ) we can easily 
bound L(Q) numerically: 

l / nn] \X\I{X-C) + \C\ln\X\ + |Cfln|W| + \ ln(4JV) - In 6 

N 

\X\I(X; C) + \C\ In \X\ + |Cf In \W\ + | ln(4iV) - In 6 



L(Q) < kr L L(Q), 



max z : kl(L(Q)\\z) < 



N 

(11) 



Similar to the approach applied by Seldin (2009) in co-clustering, in practice we can replace ( [T0| ) 
with a parameterized trade-off: 

G(Q) = PNL{Q) + \X\I(X;C) (12) 

and tune f3 either by substituting L(Q) and I(X; C) resulting from a solution of ( |T2]> back into ( fTTj ) 
or via cross-validation. In section[5]we suggest an algorithm for minimization of ( |12| i. 

4 Related Work 

The regularization of pairwise clustering by mutual information I(X; C) was already applied 



in practice by Slo nim et al.| ( |2005[ ). In their work they maximized a parameterized trade-off 
(s) - TI(X; C), where (s) = ^q(c)J2 Xl , X2 q{ x i\c)q{x 2 \c)w XlX2 measured average pairwise 
similarities within a clustei^] Their algorithm demonstrated superior results in cluster coherence 
compared to 18 other clustering methods. The regularization by mutual information was motivated 
by inf ormation-theoretic considerations inspired by the rate distortion theory ( |Cover and Th omas 



1991 1. Namely, the authors drew a parallel between (s) and distortion and I(X\ C) and compression 



rate of a clustering algorithm. Further, Yom-Tov and Slonim ( 2009 1 showed that the algorithm can 



be run in parallel mode, where each parallel worker operates with a subset of pairwise relations at 
each iteration rather than all of them. Such mode of operation was motivated by inability to consider 



all pairwise relations in very large datasets due to computational constraints. Yom-Tov and Slonim 
(2009) reported only minor empirical degradation in clustering quality, but no formal analysis and 
guarantees were suggested. 

In light of this prior work the main contribution of our paper is not as much the introduction of 
the trade-off Q( Q) in equation ( ]12[ i, but rather the formulation of graph clustering as a prediction 
problem and the analysis of the finite sample aspect of this problem. The experiments that follow 
focus on the analysis of tightness of the bound derived in section[3] 



4 The loss L(Q) is slightly more general than (s) since it also considers edges between the clusters. 
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5 An Algorithm for Graph Clustering 



In this section we derive an algorithm for minimization of the trade-off G(Q)- Unlike the co- 
clustering trade-off .F(Q) in equation (ITJ, which is convex in q(C±\Xx) and q(C 2 \X 2 ) and thus 
can be minimized by alternating projections, the trade-off G(Q) is not convex in q(C\X). Neverthe- 
less, we found in our experiments that alternating projections still provide good outcome in practice. 
Alternatively, one can apply sequential minimization techniques, as done by Yom-Tov and Slonim| 
(2009). The alternating projections are much faster though and for that reason were chosen for the 
experiments. 

The alternating projections are derived si milar to alternating projection minimization in the rate dis- 
tortion theory (Cover and Thomas 1991 1, namely by writing the Lagrangian corresponding to G(Q), 
deriving it with respect to the free parameters and equating the derivative to zero. This procedure 
provides a set of self-consistent equations, which are exactly the same as those for alternati ng pro- 
jectio n of -^(Q), hence we write the result in the Algorithm 1 box and refer the reader to (|Seldin| 



2009 



for derivation details. The only difference in our case is in the form of the derivative 



which we derive next. 



dL(Q) 
dq{c\x) 



Algorithm 1 One iteration of an alternating projection of G(Q) = PNL(Q) + \X\I(X; C). 
Input: p(x u x 2 ,w), q t (C\X), 5t ( Cl , c 2 ), N, \X\, \C\, l(w, «/), 0. 

q t +i{c\x) <- q t (c)e p " 
Z t+1 (x) <- Y JC qt+i{c\x) 

q t +i(c\x) <- Zt+l{x) 

g t +i{ci,c 2 ) <- argmin lu / J2 W l (w,w') J2 Xl X2 qt+i(ci\x 1 )p(x 1 ,x 2 ,w)q t+ i(c 2 \x 2 ) 
return q t+1 (C\X),g t (C 1 ,C 2 ). 



For notational convenience we reformulate the problem in matrix notation. For simplicity we 
assume that the edge weights w are sampled without repetition. This assumption usually holds 
in practice and it also does not affect the tightness of the analysis since the convergence rate of 
sampling without repetition is lower bounded by the convergence rate of sampling with repetition 



(Derbeko et al. 2004 1. With this assumption we can represent the training data by the Hadamard 
(also known as Schur) entrywise matrix product S o M (denoted by S . * W in Matlab), where 
Sij = 1 if the edge from node i to node j was observed in the sample and Sij = otherwise, 

and Wij — Wij. In order to obtain the derivative we have to assume a specific form 

of l(w,w'). We choose quadratic loss l(w,w') = (w — w') 2 . The maximum likelihood recon- 
struction (the one that minimizes L(Q)) for the quadratic loss is a delta distribution q(w\ci, c 2 ) = 
S(w, g{d,c 2 )), where g{c u c 2 ) = argmiiw K w , w ') J2 Xu x 2 q{ci\x 1 )p(x 1 ,x 2 , w)q(c 2 \x 2 ) = 
Ex x 2 w l{ c i\ x i) w P( x ii x 2, w)q(c 2 \x 2 ). This enables us to write the prediction model ([8| and the 
loss L(Q) in a matrix form. Let Q be the matrix of q(c\x) with rows indexed by cluster variables and 
columns indexed by node variables and G be the matrix of weights predicted in the cluster product 
space. We denote the elements of G by g{c\, c 2 ). The prediction model ^ can then be written as 

g{xi,x 2 ) = ^2 l{ci\xi)g{ci,c 2 )q{c 2 \x 2 ) (13) 

<=l,c 2 

and the corresponding reconstruction matrix is Q T GQ. Note that g(xx,x 2 ) is a function of X\,x 2 , 
which corresponds to a probability distribution q(w\xi, x 2 ), which is a delta function. The loss can 
then be written as: 

L(Q) = ^\\So(M-Q T GQ)\\ 2 2 , (14) 
where || • \\ 2 is the squared Frobenius norm of a matrix. The maximum likelihood G is given by 
G = Q(S o M)Q T /N and the derivative = 4G T Q{S o (Q T GQ - M))/N. 

Equation ([14) provides an easy way to see why L(Q) and hence G(Q) are not convex in Q - since Q 
appears in forth power. Therefore, repeated iteration of alternative projections in Algorithm 1 is not 
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guaranteed to converge (and indeed it does not). However, we found that empirically even a single 
iteration of Algorithm 1 achieves remarkably good results and due to simplicity of the algorithm 
it is easy to try multiple random initializations and obtain results comparable to those obtained by 
sequential optimization within much shorter time. This was the strategy followed in this paper. For 
large number of clusters we found it useful to anneal j3 from a lower value f3' = 1/N up to the 
desired value in two-fold increments. At each value of f3 we iterated alternating projections for 
5 times and then added a small random noise to q(c\x) before increasing /3 by a factor of 2 until 
reaching the desired value. 



5.1 Correction for Edge Weight Quantization 

We note that the alternating projections algorithm derived above operates with continuous weights 
w, whereas the analysis in theorem[2]allows only a finite set of edge weights. If the edge weights are 
uniformly quantized at intervals A, then \W\ — ^ (assume that the quantization starts at |A and 
ends at 1 — I A). By rounding the continuous edge weights obtained by the alternating projections 
toward the closest quantization both the empirical and the expected loss are increased by at most 
A + j A 2 . This is because quantization can shift the prediction by at most |A and then l(w,w' + 
\ A) = (w - w' - \ A) 2 = (w - w') 2 - (w - w')A + \ A 2 < l(w, w') + A + ± A 2 , where the 
last inequality follows from the assumption that the loss l(w, w') is bounded by 1. Hence, for the 
continuous weights we have 

i (-, , A 2 \X\I(X;C) + \C\\n\X\-\C\ 2 lnA+lln^\ A 2 
L(Q)<kr 1 U(Q) + A+— , 1 1 V ' 11 1 11 2 Elj +A+— . 

(15) 

As a rule of thumb we have taken A = 5\C\ 2 /N, so that the contribution of A to the two operands 
of the inverse KL-divergence is approximately equivalent. In general this correction for quantization 
had no significant influence on the bound. 



6 Applications 

We evaluate the bound derived in section [3] and the algorithm for its minimization from section [5] 
on two real-life datasets used previously in (|Yom-Tov and Slonim| |2009). The first dataset named 
"king" was taken from (Gu mmadi et al.||2002| ). The graph represents a set of 1,740 DNS servers 



and the edge weights correspond to similarities between the servers. The similarities are negative 
exponents of the latencies between the servers scaled by dividing by the median value of all latencies 
in the data. The second dataset contained the graph of all known pairwise interactions among 5,202 
Yeast proteins, downloaded on February 15, 2008 from the BioGRID web sit^J The edge weights 
were set to be 1 between interacting proteins and otherwise. 



King Dataset Experiments 

In the first experiment we split the king dataset into five random train, cross-validation, and test 
subsets. The train set size is 103,866 edge weights, the cross-validation set size is 25,967 edges and 
the test set consists of the remaining 1,383,097 edges. The size of the train set is only 3.4% of all 
edges or if compared to the size of the node space the number of observed edges is 8|X| In \ X\. 
This level of sparsity is even slightly lower than the 5.3% fraction of edges considered in each 
iteration of the parallel Must algorithm in ( Yom-Tov and Slonim 2009) (the total number of edges 



considered in all iterations of parallel Must was generally larger). We cluster the graph into 41 
clusters, which is the same number used by | Yom-Tov and Sloni m (2009) and compare the test loss 
and the value of bound ( p"5j ) as a function of j3. I.e., for each value of j3 we minimize Q ( Q) using the 
alternating projections algorithm and substitute the resulting L(Q) and I(X; C) into ( fT5] l to compute 
the bound. The result is shown in Figure [T]a. The bound is not perfectly tight, mainly due to the 
large |C| 2 In \ W\ term in this case. Nevertheless, the bound is meaningful and the cross-validation 
loss almost coinsides with the test loss. 

In the second experiment we consider all edges and cluster the dataset into \C\ = 1, 2, .., 15 clusters. 
(Due to symmetry every edge in this dataset appears twice, once from node i to j and another time 



5 http : // www. thebiogrid. org/ downloads .php 
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(b) Model Order Selection 



Figure 1: King dataset experiments, (a) Bound ( fT3j ), cross-validation loss, and test 
function of j3. Error bars indicate one standard deviation. The minimum of the bound is 
by the black "*". Cross-validation follows the test loss so closely, that the curves coincide, 
loss L(Q), information I(X; C), and bound ( fT5| as a function of \C\. 



loss as a 
indicated 
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(a) Original Dataset (b) Clustered 

Figure 2: Illustration of the king dataset. (a) Original dataset. (b) Clustering into 7 clusters. 



from node j to i, but in our analysis we consider only one copy of each edge.) The value of j3 in the 
optimization trade-off G(Q) was set to 1. In general by search for the optimal (3 the results could 
be improved slightly, although as we can see from the previous experiment not considerably, so we 
omitted the search over f3 in this experiment. The results are shown in Figure [T]b. First, we see 
that modeling this dataset by clustering is provably beneficial: the expected loss in predicting the 
weights of missing edges (would there be any) drops from 0.046 when predicting the weight with the 
global average to 0.02 when using four clusters and remains roughly at this level when the number 
of clusters is further increased. To the best of our knowledge, this is the first time when the benefit 
of clustering is formally proven and measured without any assumptions on the distribution that 
generated the edge weights (except that they were generated independently from that distribution). 
In this experiment there is no test set, but we can see that the bound follows the train loss pretty 
tightly. The mutual information preserved by the clusters on the node variables saturates at about 
1.2-1.5 nats, which corresponds to effective complexity of about four clusters. Clustering of the 
dataset into seven clusters is illustrated in Figure [2] 
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(b) Model Order Selection 



Figure 3: Yeast dataset experiments, (a) Bound ( fT5] > and test loss as a function of f3. The minimum 
of the bound is indicated by the black "*". (b) Train loss L(Q), information I(X; C), and bound 
((T3J as a function of |C|. Note that the bound scale is on the left hand side. 



Yeast Dataset Experiments 

In our first experiment we apply five random splits of the dataset into 445,125 training and 
13,082,676 test edges. Training edges constitute only 3.3% of all the edges or 10|X| In \X\ if com- 
pared to the number of graph nodes. As previously, the train set sparsity is slightly lower than the 
5.3% sparsity considered in each iteration of the parallel Iclust algorithm in ( |Yom-Tov and Sl onim 



20091. We cluster the graph into 71 clusters, which is the same number as used by |Yom-Tov anc 



Slonim (2009 1. The comparison of test loss with the value of the bound is presented in Figure [3]a. 
The bound is not perfectly tight, mainly due to the |C| 2 In | W\ term, but is still meaningful. 

In our second experiment we consider all edges and cluster the graph into \C\ = 1, .., 10 clusters. 
(Symmetric edges from i to j and from j to % were considered only once.) The value of j3 was set 
to 256. The results are shown in Figure [3]b. As with the king dataset experiment, the results could 
be slightly improved by optimizing j3, however even for the large value of j3 chosen the empirical 
loss L(Q) exhibits very minor decrease as the number of clusters grows, hence the results would 
not change considerably by tuning j3. Due to lower number of clusters and larger training set the 
bound is much tighter here than in the first yeast experiment (note that the bound y scale is on the 
left hand side of the graph). Unlike in the king experiment the bound tells that clustering does not 
help in modeling this dataset. 



7 Discussion 



We have formulated graph clustering as a prediction problem. This formulation enables direct com- 
parison of graph clustering with any other approach to modeling the graph. By applying PAC- 
Bayesian analysis we have shown that graph clustering should optimize a trade-off between empir- 
ical fit of the observed graph and the mutual information that clusters preserve on the graph nodes. 
Prior work of Slonim et al. (2005} and |Yom-Tov and Sl onim (2009) underscores practical benefits of 
such regularization. Our formulation suggests a better founded and accurate way of dealing with the 
finite sample nature of the graph clustering problem and tuning the trade-off between model fit and 
model complexity. It also suggests formal guarantees on the approximation quality. In particular 
such guarantees can be used for optimization of a trade-off between approximation precision and 
computational workload in processing of very large datasets. Our experiments show that the bound 
is reasonably tight for practical purposes. 
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